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Abstract 


Human-level  visual  performance  has  remained  largely  beyond  the  reach  of  engineered 
systems  despite  decades  of  research  and  significant  advances  in  problem  formulation, 
algorithms  and  computing  power.  We  posit  that  significant  progress  can  be  made  by 
combining  existing  technologies  from  machine  vision,  insights  from  theoretical  neuro¬ 
science  and  large-scale  distributed  computing.  Such  claims  have  been  made  before  and 
so  it  is  quite  reasonable  to  ask  what  are  the  new  ideas  we  bring  to  the  table  that 
might  make  a  difference  this  time  around.  From  a  theoretical  standpoint,  our  primary 
point  of  departure  from  current  practice  is  our  reliance  on  exploiting  time  in  order  to 
turn  an  otherwise  intractable  unsupervised  problem  into  a  locally  semi-supervised,  and 
plausibly  tractable,  learning  problem.  From  a  pragmatic  perspective,  our  system  archi¬ 
tecture  follows  what  we  know  of  cortical  neuroanatomy  and  provides  a  solid  foundation 
for  scalable  hierarchical  inference.  This  combination  of  features  provides  the  framework 
for  implementing  a  wide  range  of  robust  object-recognition  capabilities. 
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Final  Report 


Funding  for  this  grant  was  cut  oft  a  few  months  after  it  begun  with  the  consequence  that 
work  was  barely  begin  on  the  core  problems  described  in  statement  of  work.  Nonetheless, 
work  has  proceeded  on  related  problems  funded  by  other,  non-governmental  sources,  and 
so  it  makes  sense  for  this  filial  report  for  this  award  to  provide  a  broader  survey  of  the 
related  work  and  the  prospects  for  its  success  in  the  coming  years.  We  start  with  a  little' 
history  which  sets  the  stage  for  the  current  resurgence  in  interest  in  brain-like  computing 
architectures. 

In  July  of  2005,  Tom  Dean,  the  principal  investigator  for  this  grant  presented  a  paper 
at  AAAI  entitled  "A  Computational  Model  of  the  Cerebral  Cortex”  [5].  The  paper  de¬ 
scribed  a  graphical  model  of  the  visual  cortex  inspired  by  David  Mumford’s  computational 
architecture  [21,  22,  17].  At  that  same  meeting,  Jeff  Hawkins  gave  an  invited  talk  entitled 
“From  AI  Winter  to  AI  Spring:  Can  a  New  Theory  of  Neocortex  Lead  to  Truly  Intelligent 
Machines?”  drawing  on  the  content  of  his  popular  book  On  Intelligence  [11].  A  month 
later  at  IJCAI,  Geoff  Hinton  gave  his  Research  Excellence  Award  lecture  entitled  “What 
kind  of  a  graphical  model  is  the  brain?” 

In  all  three  cases,  the  visual  cortex  is  cast  in  terms  of  a  generative  model  in  which  en¬ 
sembles  of  neurons  are  modeled  as  latent  variables.  All  three  of  the  speakers  were  optimistic 
regarding  the  prospects  for  realizing  useful,  biologically-inspired  systems.  In  the  interven¬ 
ing  two  years,  we  have  learned  a  great  deal  about  both  the  provenance  and  the  practical 
value  of  those  ideas.  The  history  of  the  most  important  of  these  is  both  interesting  and 
helpful  in  understanding  whether  these  ideas  are  likely  to  yield  progress  on  some  of  the 
most  significant  challenges  facing  AI. 

Are  we  poised  to  make  a  significant  leap  forward  in  understanding  computer  and  bio¬ 
logical  vision?  If  so,  what  are  the  main  ideas  that  will  fuel  this  leap  forward  and  are  they 
new  or  recycled?  How  important  is  the  role  of  Moore’s  law  in  our  pursuit  of  human-level 
perception?  The  full  story  delves  into  the  role  of  time,  hierarchy,  abstraction,  complexity, 
symbolic  reasoning,  and  unsupervised  learning.  It  owes  much  to  insights  that  have  been 
discovered  and  forgotten  at  least  once  every  other  generation  for  many  decades. 

Since  J.J.  Gibson  [9]  presented  his  theory  of  ecological  optics,  scientists  have  followed  his 
lead  by  trying  to  explain  perception  in  terms  of  the  invariants  that  organisms  learn.  Peter 
Foldiak  [7]  and  more  recently  Wiskott  and  Sejnowski  [33j  suggest  that  we  learn  invariances 
from  temporal  input  sequences  by  exploiting  the  fact  that,  sensory  input  tends  to  van 
quickly  while  the  environment  we  wish  to  model  changes  gradually.  Gestalt  psychologists 
and  psychophysicists  have  long  studied  spatial  and  temporal  grouping  of  visual  objects  and 
the  way  in  which  the  two  operations  interact  [15],  and  there  is  certainly  ample  evidence  to 
suggest  concrete  algorithms. 
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The  idea  of  hierarchy  plays  a  central  role  in  so  many  disciplines  it  is  fruitless  to  trace 
its  origins.  Even  as  Hubei  and  YYeisel  unraveled  the  first  layers  of  the  visual  cortex,  they 
couldn’t  help  but  posit  a  hierarchy  of  representations  of  increasing  complexity  [13].  How¬ 
ever.  direct  evidence  for  such  hierarchical  organization  is  slim  to  this  day  and  machine 
vision  has  yet  to  actually  learn  hierarchies  of  more  than  a  couple  layers  despite  compelling 
arguments  for  their  utility  [31]. 

Horace  Barlow  [1]  pointed  out  more  than  forty  years  ago  that  strategies  for  coding 
visual  information  should  take  advantage  of  the  statistical  regularities  in  natural  images. 
This  idea  is  the  foundation  for  work  on  sparse  representations  in  machine  learning  and 
computational  neuroscience  [23]. 

By  the  time  information  reaches  the  primary  visual  cortex  (VI),  it  has  already  gone 
through  several  stages  of  processing  in  the  retina  and  lateral  geniculate.  Following  the 
lead  of  Hubei  and  Wiesel,  many  scientists  believe  that  the  output  of  V 1  can  be  modeled 
as  a  tuned  band-pass  filter  bank.  The  component  features  or  basis  for  this  representation 
are  called  Gabor  filters  —  mathematically,  a  Gabor  filter  is  a  two-dimensional  Gaussian 
modulated  by  a  complex  sinusoid  and  are  tuned  to  respond  to  oriented  dark  bars  against 
a  light  background  (or,  alternatively,  light  bars  against  a  dark  background).  The  story  of 
why  scientists  came  to  this  conclusion  is  fascinating,  but  the  conclusion  may  have  been 
premature;  more  recent  work  suggests  Gabor  filters  account  for  only  about  20%  of  the 
variance  observed  in  the  output  of  VI  [2  1]. 

We  re  sure  that  temporal  and  spat  ial  invariants,  hierarchy,  levels  of  increasingly  abstract 
features,  unsupervised  learning  of  image  statistics  and  other  core  ideas  that  have  been 
floating  around  for  decades  must  be  part  of  the  answer,  but  machine  vision  still  falls  far 
short  of  human  capability  in  most  respects.  Are  we  on  the  threshold  of  a  breakthrough  and 
if  so  what  will  push  us  through  the  final  barriers? 

Temporal  and  Hierarchical  Structure 

The  primate  cortex  serves  many  functions  and  most  scientists  would  agree  that  we've 
discovered  only  a  small  fraction  of  its  secrets.  Let’s  suppose  that  our  goal  is  to  build  a 
computational  model  of  the  ventral  visual  pathway,  the  neural  circuitry  that  appears  to  be 
largely  responsible  for  recognizing  what  objects  are  present  in  our  visual  field.  A  successful 
model  would,  among  other  things,  allow  us  to  create  a  video  search  platform  with  the  same 
quality  and  scope  that  Google  and  Yahoo!  provide  for  web  pages.  Do  we  have  the  pieces  in 
place  to  succeed  in  the  next  two  to  five  years? 

In  many  areas  of  science  and  engineering,  time  is  so  integral  to  the  description  of  the 
central  problem  that  it  can’t  be  ignored.  Certainly  this  is  the  case  in  speech  understand¬ 
ing,  automated  control,  and  most  areas  of  signal  processing.  By  contrast,  in  most  areas  of 
machine  vision  time  has  been  considered  a  distraction,  a  complicating  factor  that  we  can 
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safely  ignore  until  we’ve  figured  out  how  to  interpret  static  images.  The  prevailing  wisdom 
is  that  time  will  only  make  the  problem  of  recognizing  objects  and  understanding  scenes 
more  difficult.  A  similar  assumption  has  influenced  much  of  the  work  in  computational  neu¬ 
roscience,  but  now  that  assumption  is  being  challenged.  There  are  a  number  of  proposals 
that  suggest  time  is  an  essential  ingredient  in  explaining  human  perception  [7.  (>.  12.  33]. 
The  common  theme  uniting  these  proposals  is  that  the  perceptual  sequences  we  experience 
provide  essential  cues  that  we  exploit  to  make  critical  discriminations.  Consecutive  sam¬ 
ples  in  an  audio  recording  or  frames  in  a  video  sequence  are  likely  to  be  examples  of  the 
same  pattern  undergoing  changes  in  illumination,  position,  orientation,  etc.  The  examides 
provide  exactly  the  variation  required  to  train  models  able  to  recognize  patterns  invariant 
with  respect  to  the  observed  transformations. 

There  is  an  even  more  compelling  explanation  when  it  comes  to  learning  hierarchies  of 
spatial  and  temporal  features.  Everyone  agrees,  despite  a  lack  of  direct  evidence,  that  the 
power  of  the  primate  cortex  to  learn  useful  representations  owes  a  great  deal  to  its  facility  in 
organizing  concepts  in  hierarchies.  Hierarchy  is  used  to  explain  the  richness  of  language  and 
our  extraordinary  ability  to  quickly  learn  new  concepts  from  only  a  few  training  examples. 

It  seems  likely  that  learning  such  a  hierarchical  representation  from  examples  (input  and 
and  output  pairs)  is  at  least  as  hard  as  learning  polvnomial-size  circuits  in  which  the  sub¬ 
concepts  are  represented  as  bounded-input  boolean  functions.  Kearns  and  Valiant  showed 
that  the  problem  of  learning  polynomial-size  circuits  (in  Valiant’s  probably  approximately 
correct  learning  model)  is  infeasible  given  plausible  cryptographic  limitations  [IT] . 

However,  if  we  are  provided  access  to  the  inputs  and  outputs  of  the  circuit’s  internal 
subconcepts,  then  the  problem  becomes  tractable  [27].  This  implies  that  if  we  had  the 
“circuit  diagram”  for  the  visual  cortex  and  could  obtain  labeled  data,  inputs  and  outputs, 
for  each  component  feature,  robust  machine  vision  might  become  feasible.  Instead  of  labels, 
we  have  knowledge  based  on  millennia  of  experience  summarized  in  our  genes  enabling 
us  to  transform  an  otherwise  intractable  unsupervised  learning  problem  (one  in  which  the 
training  data  is  unlabeled)  into  a  tractable  semi-supervised  problem  (one  in  which  we  can 
assume  that  consecutive  samples  in  time  series  are  more  likely  than  not  to  have  the  same 
label). 

This  property  called  temporal  coherence  is  the  basis  for  the  optimism  of  several  re¬ 
searchers  that  they  can  succeed  where  others  have  failed.1  If  we’re  going  to  make  progress, 
temporal  coherence  has  to  provide  some  significant  leverage.  It  is  important  to  realize, 
however,  that  exploiting  temporal  coherence  does  not  completely  eliminate  complexity  in 

lOr,  at  least,  it  is  one  half  of  the  basis  for  betting  on  the  success  of  this  approach,  and  the  half  on 
which  we  have  concentrated.  The  other  half  relies  on  the  fact  that  machines,  like  humans,  can  intervene 
in  the  world  to  resolve  ambiguity  and  distinguish  cause  from  correlation.  Intervention  can  be  as  simple  as 
exploiting  parallax  to  resolve  accidental  alignments  between  distinct  but  parallel  lines. 
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learning  hierarchical  representations.  Knowing  that  consecutive  samples  are  likely  to  have 
the  same  label  is  helpful,  but  we  are  still  left  with  the  task  of  segmenting  the  time  series 
into  subsequences  having  the  same  label,  a  problem  related  to  learning  llMMs  [Sj . 

Learning  Invariant  Features 

It's  hard  to  come  up  with  a  trick  that  nature  hasn’t  already  discovered.  And,  while 
nature  is  reluctant  to  reveal  its  tricks,  decades  of  machine  vision  researchers  have  come  up 
with  their  own.  Whether  or  not  we  have  found  neural  analogs  for  the  most  powerful  of 
these,  a  pragmatic  attitude  dictates  adapting  and  adopting  them,  where  possible.  David 
Lowe  [18]  has  developed  an  effective  algorithm  for  extracting  image  features  calk'd  SIFT 
(for  scale  invariant  feature  transform).  The  algorithm  involves  searching  in  scale  space  for 
features  in  the  form  of  small  image  patches  that  can  be  reliably  recovered  in  novel  images. 
These  features  are  used  like  words  in  a  dictionary  to  categorize  images,  and  each  image  is 
summarized  as  an  unordered  collection  of  such  picture  words. 

The  basic  idea  of  using  an  unordered  collection  of  features  to  support  invariance  has 
been  around  for  some  time.  It  even  has  a  fairly  convincing  story  in  support  of  its  biological 
plausibility  [26]. 2  However,  balancing  invariance  (which  encourages  false  positives)  against 
selectivity  (which  encourages  false  negatives)  requires  considerable  care  to  get  right.  For 
instance,  one  approach  argues  that  overlapping  features  which  correspond  to  the  recep¬ 
tive  fields  of  cortical  neurons  avoid  false  positives  [31];  another  approach  provides  greater 
selectivity  by  taking  into  account  geometric  relationships  among  the  picture  words  [30]. 

Lowe's  t  rick  of  searching  in  scale  space  has  an  analog  in  finding  spat  iotemporal  features 
in  video  [16].  Finding  corresponding  points  in  consecutive  frames  of  a  video  is  relatively 
easy  for  points  associated  with  patches  that  stand  out  from  their  surrounding  context. 
We  can  exploit  this  fact,  to  track  patches  across  multiple  frames.  This  method  identifies 
features  that  are  persistent  in  time.  In  addition,  it  learns  to  account  for  natural  variation  in 
otherwise  distinctive  features.  There  are  cells  in  the  retina,  lateral  geniculate'  and  primary 
visual  cortex  whose  receptive  fields  span  space  and  time  and  are'  capable,  in  theory,  of 
performing  this  sort  of  tracking  [4].  Exactly  how  these  spatiotemporal  receptors  are  used 
to  extract  shape,  distinguish  figure  from  ground  and  infer  movement,  is  unknown,  but 
clearly  here  is  another  place  where  time  plays  a  key  role  in  human  perception. 

Why  the  ventral  visual  pathway? 

Our  focus  is  on  VI  and  the  ventral  visual  pathway,  that  part  of  V2  responsible  for 
identifying  what,  is  in  the  visual  field.  The  motivation  is  that  this  seems  to  bo  the  sweet 
spot  for  relatively  uncontested  knowledge  concerning  brain  function  and  understanding 
of  what  is  being  represented  and  how  it  is  being  computed.  By  way  of  contrast,  the 

2Serre  et  al  [20]  describe  a  system  achieving  state-of-the-art.  performance  in  object  recognit  ion  by  com¬ 
bining  biological  clncs  and  techniques  from  machine  vision. 


dorsal  visual  pathway  responsible  for  tracking  the  location  and  motion  of  objects  in 
our  visual  field  [10]  —  is  less  well-studied  and  presents  a  more  complicated  picture.  As 
an  example,  positional  information  appears  to  be  affected  by  motion  [32]  and  determined 
relative  to  a  primary  object  [2].  Saccades  and  head  movements  tend  to  change  the  point  of 
view  and  must  somehow  be  integrated  with  the  retinotopically  mapped  information  flowing 
through  the  lateral  geniculate  nuclei.  Both  of  these  present  us  with  problems  in  how  to 
integrate  disparate  information  sources,  including  mixing  retinotopic  with  non-retinotopic 
information,  a  challenging  problem  for  which  we  presently  lack  any  clear  solution  (see  Rolls 
and  Stringer  [28]  for  an  interesting  start). 

While  we  believe  VI  and  the  ventral  pathway  provide  useful  clues  for  developing  artificial 
perceptual  systems,  we  are  neither  so  naive  nor  so  ignorant  as  to  be  unaware  that,  the  brain 
still  holds  many  secrets,  and  our  model  does  not  even  account  for  all  the  currently  extant 
data.  As  mentioned  above,  Olshausen  and  Field  [24]  give  the  somewhat  pessimistic  estimate 
that  we  presently  understand  only  about  20%  of  Vi’s  functional  behavior.  We  simply  don’t 
know  what  the  other  80%  of  the  computation  is,  whether  it  is  important,  or  what  it  might 
be  useful  for.  On  the  other  side  of  the  coin,  we  know  that  attention  plays  a  significant 
role  in  visual  perception  [25],  but  we  are  assuming  we  can  make  progress  without  detailed 
understanding  of  human  attentional  mechanisms.  A  similar  situation  pertains  at  the  level  of 
neuroanatomv.  Our  model  incorporates  a  version  of  feedforward  and  feedback  connections, 
but  does  not  presently  include  lateral  connections  [19].  Again,  we  believe  we  can  achieve 
some  modicum  of  success  without  lateral  connections,  but  we  await  experimental  results 
before  venturing  a  definitive  answer.  The  take  away  on  this  is  that  we  expect  to  adapt  our 
model  in  response  to  shortcomings  exposed  by  experimentation,  but  we  are  aware  both  of 
the  gaps  in  our  knowledge  of  the  brain  and  the  discrepancies  between  our  model  and  what 
is  presently  known  about  the  visual  cortex. 

Will  big  ideas  or  big  iron  win  the  race? 

Compared  with  previous  approaches,  there  is  one  other  advantage  we  have  allowing  us 
to  consider  models  of  realistic  scale:  increased  computing  power  and  the  where  withal  to 
take  advantage  of  it.  The  human  primary  visual  cortex  (VI)  consists  of  approximately  109 
neurons  and  6  x  109  connections  [3].  Whereas  we  have  relatively  poor  data  for  modeling 
individual  neurons,  despite  the  press  for  the  IBM  /  EPFL  Blue  Brain  Project,  we  are 
better  positioned  with  respect  to  the  aggregate  behavior  of  thousands  of  neurons.  Most 
of  the  serious  computational  brain  models  aim  at  the  level  of  a  cortical  hyper- column,  a 
structural  unit  consisting  of  10 1  105  neurons  [20].  If  we  assign  one  processor  per  hyper- 
column,  a  computing  cluster  with  10'*  processor  cores  and  accompanying  communications 
capacity  can  simulate  on  the  order  of  108  neurons.  This  would  be  about  10%  of  V7 1 ,  and 
somewhat  beyond  the  reach  of  most  academic  labs,  but  Google  and  several  other  industrial 
labs  can  field  resources  at  this  scale  and  beyond.  Working  at  a  smaller  scale  would  risk 
confounding  effects  introduced  by  the  scale  with  effects  of  the  model  itself.  Working  at 
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a  realistic  scale  allows  us  to  focus  on  the  model.  Moreover,  deploying  resources  at  scal(' 
allows  us  to  turn  around  experiments  in  minutes  or  hours,  as  opposed  to  days  or  weeks. 
This  means  we  can  iterate  over  alternatives,  which  will  surely  be  necessary,  and  explore  the1 
space  of  solutions  in  a  way  not  practical  for  an  under-resourced  system. 

Simulating  the  brain  to  achieve  human-level  sensory  and  cognitive  abilities  is  just  start¬ 
ing  to  make  the  transition  from  tantalizing  possibility  to  practical,  engineered  reality.  Mak¬ 
ing  that  transition  will  take  both  good  ideas  including  venerable  old  ideas  and  some  new 
ones  -  and  heavy-duty  computing  power.  While  we  believe  the  inclusion  of  time  is  an 
essential  element  of  a  solution,  and  our  model  offers  promise  of  success,  we  are  aware  that 
more  may  be  necessary.  The  extant  data  on  brain  function  is  notably  sparse,  and  our  model 
makes  no  attempt  to  take  all  known  brain  function  into  account,  e.g.,  attentional  mecha¬ 
nisms.  What  we  do  have  is  a  plausible  model  biologically  inspired  though  certainly  not 
biologically  accurate  and  the  tools  to  evaluate  and  improve  it.  While  many  of  the  ideas 
have  been  around  for  some  time,  the  infrastructure  to  quickly  and  convincingly  evaluate 
them  has  been  lacking.  Robust,  high-performance  distributed  computing  hardware  and 
software  doesn't  make  you  smarter,  but  it  does  allow  you  to  reach  a  little  further  and  that 
could  make  the  crucial  difference. 
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